3.2 Basic NLP and Text Checks
Prior to diving into our natural language processing analysis, we performed a series of fundamental text examinations and analyses on the dataset.
3.2.1 Distribution of Text Length
Using a user-defined function to determine the length of each document, we analyze the distribution of text length across submissions and comments. On average, posts related to dogecoin have approximately 110 words, the average comment length is 11.9 words, and the title length is an average of 9.3 words.
Table 1. Summary statistics of post/comment lengths
Text | Average | Maximum | Minimum |
Posts | 109.98 | 5532 | 1 |
Comment | 11.9 | 1404 | 1 |
Title | 9.3 | 80 | 1 |
The following histogram plots the distribution of text lengths in posts and comments, colored according to the subreddit where it is posted. As the following diagram shows a striking contrast - the majority of very short length posts are more prevalent in r/dogecoin
, while r/CryptoCurrency
posts are on the longer end of the distribution. This might indicate that r/CryptoCurrency
has higher-quality or higher-information posts than r/dogecoin
.
Figure 1. Histogram of post lengths for both subreddits
3.2.2 Frequent Words
By breaking down and spreading out the words from the clean output of our pipeline, we counted the most commonly occurring words for both submissions and comments.
Figure 2: Top 10 most frequently used words
Note: (based on a sample)
3.2.3 URLs
Based on regex-based search of URLs, we find that a higher percentage of posts and comments in r/CryptoCurrency
contain URLs than those in r/dogecoin
. This hints - but does not confirm - our hypothesis that posts in r/dogecoin
may have lesser quality information, or lesser citations and external links to support their information.
Figure 3: Bar graph of share of posts containing URLs